I wanted to see if samples grouped by geographic location, and given that sites range widely in distance from one another, I decided to visually assess using a color scheme that at least roughly represented their location in 2D space. See the map below:
In the NMDS (Sørensen), we see definite clustering by Site, but not all groupings are distinct.
We also see evidence of potentially significant spatial relationships (i.e., distance-decay).
Background
Microbiome sample clustering can be performed using either model-based methods and machine learning methods.
- Machine learning methods, which rely on defined distance metrics, are used more frequently than model-based statistical methods (“due to their efficient implementation and easy interpretation.”)
- I used the partition around medoids (PAM) clustering method, which is related to but considered more robust than K-means. In contrast to K-means, which can be sensitive to the effects of outliers, PAM’s optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers.
Note: clustering was performed directly on distance matrices, not ordinations or ordination scores
# Perform PAM clustering and save - Fun
pam_fwc_sor_k10 <- pam(Fun_wc_sorensen_distmat, k = 10, diss = T, cluster.only = T)
saveRDS(pam_fwc_sor_k10, file = "../processed_data/clean_rds_saves/pam/pam_fwc_sor_k10.rds")
# Perform PAM clustering and save - Bac
pam_bwc_sor_k14 <- pam(Bac_wc_sorensen_distmat, k = 14, diss = T, cluster.only = T)
pam_bwc_sor_k14<-readRDS(file = "../processed_data/clean_rds_saves/pam/pam_bwc_sor_k14.rds")
saveRDS(pam_bwc_sor_k14, file = "../processed_data/clean_rds_saves/pam/pam_bwc_sor_k14.rds")
# add bac pam to bac sd
sample_data(Bac_wholecommunity)$bac_clus_sor_k10_og <- pam_bwc_sor_k14
sample_data(Bac_wholecommunity)$bac_clus_sor_k10_og <- as.factor(sample_data(Bac_wholecommunity)$bac_clus_sor_k10_og)
# add fun pam to bac sd
sample_data(Bac_wholecommunity)$Fun_sor_clusters <- sample_data(Fun_wholecommunity)$Fun_sor_clusters
sample_data(Bac_wholecommunity)$Fun_sor_clusters <- as.factor(sample_data(Bac_wholecommunity)$Fun_sor_clusters)
saveRDS(Bac_wholecommunity,file="../processed_data/clean_rds_saves/Bac_wholecommunity.rds")
# tail(colnames(sample_data(Bac_wholecommunity)))
# tail(sample_data(Fun_wholecommunity)[1:8])
# add bac pam to fun sd
sample_data(Fun_wholecommunity)$bac_clus_sor_k10_og <- as.factor(sample_data(Bac_wholecommunity)$bac_clus_sor_k10_og)
tail(colnames(sample_data(Fun_wholecommunity)))
saveRDS(Fun_wholecommunity,file="../processed_data/clean_rds_saves/Fun_wholecommunity.rds")
Note the density of cluster 1 - I’ll investigate that further.
We do see clusters with only 1 site and others with multiple sites.
Fungal cluster descriptions
Clusters not yet relabled here…
| Cluster | total n | Site(s) | Grass(es) | Characteristics† | Exclusivity | |
|---|---|---|---|---|---|---|
| 1 | 183 | spans pH range | ||||
| 2 | 37 | BNP,DMT,FMT,SEV | BOER (n=33), BOGR (n=4) | high pH* (6.8 - 8.3, mean 7.5) | ||
| 3 | 56 | high pH**** (6.8 - 8.3, mean 7.8) | ||||
| 4 | 17 | lower pH**** (5.6 - 7.6, mean 6.2) | ||||
| 5 | 43 | high pH** (7.1 - 7.8, mean 7.6) | ||||
| 6 | 32 | lower pH**** (5.1 - 7.8, mean 6.2) | ||||
| 7 | 38 | |||||
| 8 | 29 | mostly LAR (n=27) | high pH**** (7.2 - 8.2, mean 8.0) | |||
| 9 | 9 | SFA | SCSC (only grass present) | Site=SFA | ||
| 10 | 29 | KAE | lower pH**** (5.9 - 7.2, mean 6.3) | Site=KAE (of the 32 KAE samples, only 3 others were in diff clusters) |
randomForest)Clusters: Sørensen dissimilarity clusters based on pam (k = 10)
Method: R package randomForest v4.7.1.1
predictors.all<-t(otu_table(Fun_wholecommunity))
response.Fun_sor_clusters<-as.factor(sample_data(Fun_wholecommunity)$Fun_sor_clusters)
rf.data.Fun_sor_clusters<-data.frame(response.Fun_sor_clusters, predictors.all)
classify.Fun_sor_clusters<-randomForest(response.Fun_sor_clusters~., data = rf.data.Fun_sor_clusters, ntree=999)
Call: randomForest(formula = response.Fun_sor_clusters ~ ., data = rf.data.Fun_sor_clusters, ntree = 999)
Type of random forest: classification
Number of trees: 999
No. of variables tried at each split: 81
OOB estimate of error rate: 6.82%
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Cluster error % | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | 183 | 0.0 | |||||||||
| F2 | 34 | 3 | 8.1 | ||||||||
| F3 | 5 | 50 | 1 | 10.7 | |||||||
| F4 | 12 | 4 | 1 | 29.4 | |||||||
| F5 | 1 | 50 | 1 | 1 | 5.7 | ||||||
| F6 | 7 | 23 | 2 | 28.1 | |||||||
| F7 | 1 | 2 | 34 | 1 | 10.5 | ||||||
| F8 | 1 | 27 | 1 | 6.9 | |||||||
| F9 | 10 | 0 | |||||||||
| F10 | 1 | 28 | 3.4 |
Notes: OTU205 is highest in relative abundance in Cluster 1, most other clusters have minimal or no presence. Exception is sample in Cluster 7, which was the one misclassified into Cluster 1.
OTU6679, Generalist.test: all (global RA = 1.65%) v. Cluster 1 (mean RA = 3.75%)
Fungal whole community
Clusters 2 & 3 seem to be the outgroup / most distantly related
Clusters 1 & 9 are very similar, followed by 10 -> wrap-around cluster numbering
Which OTUs are driving the differences between clusters 1 & 2?
not updated yet
average - Species contribution to average between-group dissimilarity
cusum - Ordered cumulative contribution (to between-group dissimilarity). These are based on item average, but they sum up to total 1
p - Permutation \(p\)-value. Probability of getting a larger or equal average contribution in random permutation of the group factor
Mean Sørensen dissimilarity between Clusters 1 & 2: 0.920 (unweighted) / 0.903 (weighted by sample size)
Mean within-cluster similarities
- Cluster 1: 0.514
- Cluster 2: 0.802
ppt3yr = mean annual precipitation (determined over a 3-year window)
GDD3yr = growing degree days (determined over a 3-year window)
avg_SRL = specific root length
avg_SLA = specific leaf area
herbivory_perc = average site level damage estimate for herbivory, averaged over all species and individuals at the site; could be used to indicate herbivory pressure at the site level - ranges from 1:17 (%)